Optical character recognition (OCR) is the process to transform text images to text in a computer format, giving us information in an unstructured way, so is necessary to classify this data to gain knowledge about them and to have better document retrieval. The Tradeshift company make on that foundation a machine learning based product to classify the text blocks in a document to dates, address, and names to enrich the data of the OCR process. This organization resolves to host a competition on Kaggle, a data science platform, opening their data for community try to beat their machine learning model to this classification problem. The competition and the dataset can be access through this link and is available to the Kaggle community who intends to beat their benchmark like me.
In this competition, we have to create a supervised machine learning algorithm to predict the possibility for a block of text being from a particular label, however, the block can have multiple labels. For all the documents, words are detected and combined to form text blocks that may overlap to each other. Each text block is enclosed within a spatial box, which is depicted by a red line in the sketch below. The text blocks from all documents are aggregated in a data set where each text block corresponds to one sample (row). The text is inputted by the OCR and the host organization gives us some features like the hashed content of the text, position, and size of the box, if the text can be parsed as a date, as a number and include information about the surrounds text blocks in the original document. The final classifier is intended to beat the benchmark of the Tradeshift organization, some tasks involved to reach that goal are:
The evaluation metric chosen by the organizers for this competition was the negative logarithm of the likelihood function averaged over Nt test samples and K labels. As shown by the following equation a + b =c. On the equation:
$$\textrm{LogLoss} = \frac{1}{N_{t} \cdot K} \sum_{idx=1}^{N_{t} \cdot K} \textrm{LogLoss}_{idx}$$ $$= \frac{1}{N_{t} \cdot K} \sum_{idx=1}^{N_{t} \cdot K} \left[ - y_{idx} \log(\hat{y}_{idx}) - (1 - y_{idx}) \log(1 - \hat{y}_{idx})\right]$$ $$= \frac{1}{N_{t} \cdot K} \sum_{i=1}^{N_{t}} \sum_{j=1}^K \left[ - y_{ij} \log(\hat{y}_{ij}) - (1 - y_{ij}) \log(1 - \hat{y}_{ij})\right]$$
This function penalizes probabilities that are confident and wrong, in the worst case, prediction of true(1) for a false label (0) add infinity to the LogLoss function as $-log(0) = \infty$, which makes a total score infinity regardless of the others scores.
This metric is also symmetric in the sense than predicting 0.1 for a false (0) sample has the same penalty as predicting 0.9 for a positive sample (1). The value is bounded between zero and infinity, i.e. $\textrm{LogLoss} \in [0, \infty)$. The competition corresponds to a minimization problem where smaller metric values, $\textrm{LogLoss} \sim 0$, implies better prediction models. To avoid complication with infinity values the predictions are bounded to within the range $[10^{-15},1-10^{-15}]$
This is an example from the competition If the 'answer' file is:
csv
id_label,pred
1_y1,1.0000
1_y2,0.0000
1_y3,0.0000
1_y4,0.0000
2_y1,0.0000
2_y2,1.0000
2_y3,0.0000
2_y4,1.0000
3_y1,0.0000
3_y2,0.0000
3_y3,1.0000
3_y4,0.0000
And the submission file is:
csv
id_label,pred
1_y1,0.9000
1_y2,0.1000
1_y3,0.0000
1_y4,0.3000
2_y1,0.0300
2_y2,0.7000
2_y3,0.2000
2_y4,0.8500
3_y1,0.1900
3_y2,0.0000
3_y3,1.0000
3_y4,0.2700
the score is 0.1555 as shown by:
$$L = - \frac{1}{12} \left [ log(0.9) + log(1-0.1) + log(1-0.0) +log(1-0.3) + log(1-0.03) + log(0.7) + log(1-0.2) + log(0.85) + log(1-0.19) + log(1-0.0) + log(1.0) +log(1-0.27) \right ] = 0.1555$$
In this section, I will analyze the training dataset of the competition, have some descriptive statistics and exploring the features trying to define the characteristics of this data.
%load_ext autoreload
%autoreload 2
import src.describe as d
import src.pre_processing as pre
import src.plots as pl
import pandas as pd
import numpy as np
import gc
import pickle
import os.path
from scipy import sparse
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)
train_features = d.read_train_features()
train_labels = d.read_train_labels()
train_features.shape
train_features.head()
train_features.info()
train_labels.shape
train_labels.head()
train_labels.info()
In this section, we will categorize the columns to try to facilitate the manipulation utilizing the information the description of the competition gives us before. We'll store:
meta = d.create_features_meta(train_features)
meta.head(10)
Extract all boolean features:
meta[meta.category == 'boolean'].index
See the quantity of feature per category:
pd.DataFrame({'count' : meta.groupby(['category', 'dtype'])['dtype'].size()}).reset_index()
In this section we will apply the describe method on the features splited by category and dtype to calculate the mean, standart deviation, max, min...
Numerical float variables
float_features = meta[(meta.category == 'numerical') & (meta.dtype == 'float64')].index
float_train_features = train_features[float_features]
float_train_features_describe = float_train_features.describe()
float_train_features_describe
float_train_features_describe.loc()[['min','max']]
float_train_features.isnull().any().any()
The features that are scaled between [0,1] are: x6, x7, x21, x37, x38, x52, x67, x68, x82, x97, x98, x112, x122, x123, x137.
So we could apply scaling on the other features depends on the classifier.
And we don't have any NaN values on this features.
Numerical int variables
int_features = meta[(meta.category == 'numerical') & (meta.dtype == 'int64') & (meta.role != 'id')].index
int_train_features = train_features[int_features]
int_train_features_describe = int_train_features.describe()
int_train_features_describe
int_train_features_describe.loc()[['min','max']]
int_train_features_describe.isnull().any().any()
All the int numerical features are not scaled, so depending on the algorithm we have to scale the feature, we don't have any missing value. The problem here is we don't know when the feature is a categorical feature or a quantitative.
Content variables
content_features = meta[(meta.category == 'content')].index
train_features[content_features].replace('nan', np.NaN)
train_features[content_features] = train_features[content_features].astype(str)
content_train_features = train_features[content_features]
content_train_features_describe = content_train_features.describe()
content_train_features_describe
# flattening all the words to count them
all_words = pd.Series(content_train_features.values.flatten('F'))
all_words = all_words.to_frame().reset_index()
print('total words={}'.format(all_words.shape[0]))
all_words = all_words.rename(columns= {0: 'words'})
all_words = pd.DataFrame({'count' : all_words.groupby(['words'])['words'].size()}).reset_index()
all_words.sort_values('count', ascending=False).head(10)
all_words.sort_values('count', ascending=False).head(50000)['count'].sum()
content_train_features.isnull().sum()
On the hashed words we have 979_749 unique words on 17_000_000 (1.7kk rows x 10 collumns) words giving 5.76% of uniques words on the total words. This show us that word can have a huge impact on the classifier because we have some words multiples times. But we have to take care of the NaN values and treat them.
Boolean variables
bool_vars = meta[(meta.category == 'boolean')].index
train_features[bool_vars].describe()
train_features[bool_vars].isnull().sum()
On the boolean values, only on 2 features we have no missing values. So we have to treat all this missing values here.
Labels variables
total = train_labels.shape[0]
for col in train_labels.columns:
if col != 'id':
print(train_labels[col].value_counts(sort=True))
print('')
We only have two types of response on labels 0 and 1, making a binary classification problem
for col in train_labels.columns:
if col != 'id':
total_1 = total - train_labels[col].value_counts(sort=True)[0]
perc = total_1 / total
print('Column {} has {} positive labels, {:.2%} of total'.format(col, total_1, perc))
As we see most of the labels have few positive values and the last label has 55.96% of the total rows in positive values.
train_labels.loc[:, train_labels.columns != 'id'].sum(axis=1).value_counts()
91.04% of text blocks have only one label and the rest has more than one label.
Checking Missings Values
vars_with_missing = []
for f in train_features.columns:
missings = train_features[f].isnull().sum()
if missings > 0:
vars_with_missing.append(f)
missings_perc = missings/train_features.shape[0]
category = meta.loc[f]['category']
dtype = meta.loc[f]['dtype']
print('Variable {} ({}, {}) has {} records ({:.2%}) with missing values'.format(f, category, dtype, missings, missings_perc))
print('In total, there are {} variables with missing values'.format(len(vars_with_missing)))
Some missing values variables repeat the quatity of missing values, this can be because of the relational text blocks, so if this text block is the leftmost block on the document, they will not have some block at the left. And this can go for other directions too.
Checking the cardinality of the int variables
Cardinality means the differents values of a variable, so we will see which feature will became dummy variables.
for f in int_train_features:
dist_values = int_train_features[f].value_counts().shape[0]
print('Variable {} has {} distinct values'.format(f, dist_values))
At this point i can't see if I will treat this variables as categorical and transform in dummy variables or treat them as quatitative variables.
In this section, we will explore visually the dataset trying to summarizes and extracts relevants characteristics about the data.
for v in int_features:
feature = train_features[v]
print("Variable {} has {} different values".format(v, feature.value_counts().shape[0]))
print("Ploting only the 25 largest values")
pl.categorical(train_features, v)
print(feature.describe().apply(lambda x: format(x, 'f')))
print('\n\n\n')
The numerical int variables show three different kinds of patterns: a 25% on 892, a 50% on 1263 and the rest that feel likes quantitative variables. I will classify the first as the height category, the second as the width category and the third as quantitative category.
meta.at[['x23', 'x54', 'x84', 'x114', 'x139'], 'category'] = 'height'
meta.at[['x22', 'x53', 'x83', 'x113', 'x138'], 'category'] = 'width'
meta.at[['x15', 'x17', 'x18', 'x27', 'x46', 'x48', 'x49', 'x58', 'x76', 'x78', 'x79', 'x88', 'x106', 'x108', 'x109', 'x118', 'x131', 'x133', 'x134', 'x143'], 'category'] = 'quantitative'
meta[(meta.category == 'width')]
pl.correlation_map(train_features, meta[(meta.category == 'height') | (meta.category == 'width')].index)
As I suspect, the correlation between the width and height category is very linear.
pl.correlation_map(train_features, meta[(meta.category == 'quantitative')].index)
pca_features = train_features.sample(frac=0.5)
pca = PCA(n_components=5)
pca.fit(pca_features[meta[(meta.category == 'height') | (meta.category == 'width')].index])
[ "{:0.5f}".format(x) for x in pca.explained_variance_ratio_ ]
As component principal analysis with 5 components can explain 99,21% of the variance in the height/width part of the dataset, this can be used in the implementation part of the algorithms.
pca_features = train_features.sample(frac=0.5)
pca = PCA(n_components=10)
pca.fit(pca_features[meta[(meta.category == 'quantitative')].index])
[ "{:0.5f}".format(x) for x in pca.explained_variance_ratio_ ]
As component principal analysis with 10 components can explain 92,06% of the variance in the quantitative part of the dataset, this can be used in the implementation part of the algorithms.
Checking the correlations between interval variables. A heatmap is a good way to visualize the correlation between variables.
pl.correlation_map(train_features, float_features)
As we can see, that has many features with a nice linear correlation, which can make this part of the dataset a good call for a PCA to see how much the component analysis can explain the linear correlation.
pca_features = train_features.sample(frac=0.5)
pca = PCA(n_components=10)
pca.fit(pca_features[(float_features)])
[ "{:0.5f}".format(x) for x in pca.explained_variance_ratio_ ]
As component principal analysis with 10 components can explain 91.99% of the variance in the float part of the dataset, this can be used in the implementation part of the algorithms.
for v in bool_vars:
feature = train_features[v]
pl.categorical(train_features, v)
print(feature.describe())
The majority of the boolean variables have a greater quantity of NO than YES, only on x126, x140,x142 the YES values are greater than NO values. This visualization don't give us too much information as we don't know which any boolean feature means.
I intend to use three different algorithms to beat the benchmark: Random Forest Classifier, Nearest Neighbors, and Multilayer Perceptrons.
The first one, Random Forest Classifier is a bagging method, where they create multiples independents trees on subsets of the features and then average the result of each tree. The result of the average of all tree is a better estimator than a single tree because the variance is reduced, reducing the overfitting problem.
The second one is Linear SVC, a Support Vector Machine method, which can be effective in high dimensional spaces, in memory usage, when the number of dimensions is greater than the number of samples.
The last one, Multi Layer Perceptron or Neural Networks, has the capability to learn non-linear function which can be advantageous in this project, but this kind of model has the disadvantage to be sensitive to feature scaling, requires a lot of tuning in the hyperparameters and can be stuck on a local minimum depends on the weight initializations.
In this competition, the organizers give us fours different benchmarks scores:
The score of the four benchmarks is find below, in this evaluation metric lower is better:
| Benchmark | Score |
|---|---|
| TS Baseline | 0.0150548 |
| All Halves | 0.6931471 |
| Random | 1.0122952 |
| All Zeros | 1.1706929 |
# testing the score function
y_true = [1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000]
y_pred = [0.9000, 0.1000, 0.0000, 0.3000, 0.0300, 0.7000, 0.2000, 0.8500, 0.1900, 0.0000, 1.0000, 0.2700]
log_loss(y_true, y_pred)
#testing the score function on all halves benchmark
y_pred = np.full((1700000, 33), 0.5, dtype=np.float64).flatten()
y_true = np.array(train_labels.values)[:, 1:].astype(np.float64).flatten()
log_loss(y_true, y_pred)
In this part I will explore some transformation on the dataset for memory efficiency, like the transformation of the YES/NO features to 0/1.
meta_path = "../working/meta.pkl"
if not os.path.isfile(meta_path) :
pickle.dump(meta, open(meta_path, "wb"))
meta = pickle.load(open(meta_path, "rb"))
int_features = meta[(meta.dtype == 'int64') & (meta.role != 'id')].index
float_features = meta[(meta.category == 'numerical') & (meta.dtype == 'float64')].index
bool_vars = meta[(meta.category == 'boolean')].index
content_features = meta[(meta.category == 'content')].index
As this dataset is too big for the memory I have available, I will work with only one fifth of the dataset.
ids_samples = [hash(i_id) % 5 == 0 for i_id in train_features['id']]
train_features = train_features[ids_samples]
train_features.shape
train_labels = train_labels[ids_samples]
train_labels.shape
# drop the empty label
train_labels.drop(labels=['y14'], axis="columns", inplace=True)
train_labels.shape
train_labels.to_pickle('../working/1_test_reduced.pkl')
train_features = pre.tranform_bool_df(train_features, bool_vars)
train_features.info()
train_features[bool_vars].isnull().any().any()
This transformation decreases the memory to in ~320 MB comparing to 1.8+ GB of the original dataset.
As the algorithms accept only floats, ints and booleans as input, I decide to tranform all the contents feature to boolean features where the column represent the combo feature plus the hash and the value is 1 if the pair exists or 0 if not, as this has a high memory comsuption I will go for a sparse matrix type of collumn when use get_dummies on pandas.
train_features = pre.transform_content_dummy(train_features, content_features)
train_features = pre.transform_sparse(train_features, content_features)
vec = pd.read_pickle("../working/content_dicvector.pkl")
np.nan in vec.feature_names_
train_features
After the transformation the dataset goes from 10 content input columns to 510_290.
In this section, will scale the dataset to get a better result on the PCA as they need similar scales of measurement.
train_features = pd.read_pickle("../working/3_train_numerical.pkl")
train_features = pre.transform_scale_float(train_features, float_features)
train_features[float_features].head()
train_features = pre.transform_scale_int(train_features, int_features)
train_features[int_features].head()
As the previous PCA run show us that we have the width/height features with a lot of linear correlation so will apply in this columns.
# wh -> widht/height
wh_features = meta[(meta.category == 'height') | (meta.category == 'width')].index
train_features = pre.transform_pca(train_features, wh_features)
train_features.head()
Now, I will join the datasets together to feed the machine learning algorithms
train_wh_pca_path = "../working/9_train_wh_pca.pkl"
numerical_train_features = pd.read_pickle(train_wh_pca_path)
# on the join all the features requires to be the same type
numerical_train_features = np.array(numerical_train_features).astype(np.float32)
train_only_content_encoding = "../working/4_train_only_content_encoding.pkl"
content_train_features = pd.read_pickle(train_only_content_encoding)
content_train_features = content_train_features.astype(np.float32)
content_train_features.data = np.nan_to_num(content_train_features.data)
content_train_features.shape
numerical_train_features.shape
train_features = sparse.csr_matrix(sparse.hstack([sparse.coo_matrix(numerical_train_features),sparse.coo_matrix(content_train_features)]))
train_features
pickle.dump(train_features, open('../working/11_train_pre_processed.pkl', "wb"))
Split the dataset on training and testing, saving to rerun this algorithms.
train_labels = pd.read_pickle('../working/1_test_reduced.pkl')
train_features = pd.read_pickle('../working/11_train_pre_processed.pkl')
X_train, X_test, y_train, y_test = train_test_split(train_features, train_labels, test_size=0.33, random_state=42)
# drop id
y_train = np.array(y_train)[:,1:]
y_test = np.array(y_test)[:,1:]
x_train_path = "../working/12_train_x.pkl"
x_test_path = "../working/13_test_x.pkl"
y_train_path = "../working/14_train_y.pkl"
y_test_path = "../working/15_test_y.pkl"
pre.save_dataset(X_train, x_train_path)
pre.save_dataset(X_test, x_test_path)
pre.save_dataset(y_train, y_train_path)
pre.save_dataset(y_test, y_test_path)
First of all, I will try the algorithms with only the numerical features to see where we are compared with the benchmark. And will have to split the training dataset in 2 parts to training and validation.
from sklearn.metrics import f1_score
train_numerical_path = "../working/3_train_numerical.pkl"
train_features = pd.read_pickle(train_numerical_path)
train_labels = pd.read_pickle('../working/1_test_reduced.pkl')
train_features= np.array(train_features.values).astype(np.float32)
train_labels = np.array(train_labels.values)[:, 1:].astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(train_features, train_labels, test_size=0.33, random_state=42)
The first technique to be applied is the Random Forest Classifier.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=200, n_jobs=-1, verbose=1)
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
# TS Baseline => 0.0150548
The second one is the Logistic Regression.
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
lr = OneVsRestClassifier(LinearSVC(verbose=1))
lr = lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
I have one problem here, the Logistic Regression can't classify all labels the same time, so we have to use the OneVsRest strategy in this algorithm.
The last one is a Multi Layer Perceptron.
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(128, 256, 64), max_iter=200, alpha=1e-4,
verbose=10, tol=1e-4, random_state=1)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
# TS Baseline => 0.0150548
With the original numerical dataset the results are:
| Model | Score |
|---|---|
| RandomForestClassifier | 0.17026 |
| LinearSVC | 1.14282 |
| MLPClassifier | 0.44487 |
This show the Random Forest can be the best to classify the numeric part of the dataset.
Now with the preprocessed dataset scaled, with PCA on width/height columns.
train_numerical_path = "../working/9_train_wh_pca.pkl"
train_features = pd.read_pickle(train_numerical_path)
train_labels = pd.read_pickle('../working/1_test_reduced.pkl')
train_features= np.array(train_features.values).astype(np.float32)
train_labels = np.array(train_labels.values)[:, 1:].astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(train_features, train_labels, test_size=0.33, random_state=42)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=200, n_jobs=-1)
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
# TS Baseline => 0.0150548
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
lr = OneVsRestClassifier(LinearSVC(verbose=1))
lr = lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(128, 256, 64), max_iter=500, alpha=1e-4,
verbose=10, tol=1e-4, random_state=1)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
With the scaled numerical dataset the results are:
| Model | Score | Original Dataset Score |
|---|---|---|
| RandomForestClassifier | 0.17170 | 0.17026 |
| LinearSVC | 0.63848 | 1.14282 |
| MLPClassifier | 0.25699 | 0.44487 |
The RandomForest stay almost the same but the other two algorithms scored almost the half with the scaling and PCA. The half of the score on the SVC and MLP is kind of expect as they are algorithms that can take benefits from this kind of technique.
Now, with only the content features to see how much the classifiers can do with only them.
train_only_content_encoding = "../working/4_train_only_content_encoding.pkl"
train_features = pd.read_pickle(train_only_content_encoding)
train_features = train_features.astype(np.float32)
train_features.data = np.nan_to_num(train_features.data)
train_labels = pd.read_pickle('../working/1_test_reduced.pkl')
train_labels = np.array(train_labels.values)[:, 1:].astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(train_features, train_labels, test_size=0.33, random_state=42)
del train_features
del train_labels
gc.collect()
from sklearn.ensemble import RandomForestClassifier
clf = OneVsRestClassifier(RandomForestClassifier(n_estimators=10, n_jobs=-1, verbose=1))
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
# TS Baseline => 0.0150548
As the RandomForestClassifier with all the classes in one Random Forest are taking too long to training and the deadline of the project are coming so I decide to take smaller RandomForest specific for each class with the OneVsRestClassifier.
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
lr = OneVsRestClassifier(LinearSVC(verbose=1))
lr = lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(128, 256, 64), max_iter=500, alpha=1e-4,
verbose=10, tol=1e-4, random_state=1)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
Running the algorithms with only the content part give:
| Model | Score | Scaled Dataset Score |
|---|---|---|
| RandomForestClassifier | 0.23358 | 0.17170 |
| LinearSVC | 0.19218 | 0.63848 |
| MLPClassifier | 0.20634 | 0.25699 |
The LinearSVC was the best algorithm to the content part of the problem and was the fastest to training, but the MLPClassifier come close and the hyperparameters can be tunned at the cost of more time.
x_train_path = "../working/12_train_x.pkl"
x_test_path = "../working/13_test_x.pkl"
y_train_path = "../working/14_train_y.pkl"
y_test_path = "../working/15_test_y.pkl"
X_train = pd.read_pickle(x_train_path)
X_test = pd.read_pickle(x_test_path)
y_train = pd.read_pickle(y_train_path)
y_test = pd.read_pickle(y_test_path)
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(RandomForestClassifier(n_estimators=10, n_jobs=-1, verbose=1))
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
# TS Baseline => 0.0150548
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
lr = OneVsRestClassifier(LinearSVC(verbose=1))
lr = lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(128, 256, 64), max_iter=500, alpha=1e-4,
verbose=10, tol=1e-4, random_state=1)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
Running the algorithms with the scaled and content dataset give:
| Model | Score | Scaled Dataset Score |
|---|---|---|
| RandomForestClassifier | 0.23732 | 0.17170 |
| LinearSVC | 0.16703 | 0.63848 |
| MLPClassifier | 0.15581 | 0.44487 |
The MLPClassifier was the best algorithm to the content part of the problem but take much longer than the LinearSVC with similar result.
On the implementation stage, I got wrong the metric so I have to rerun all the algorithms besides that sometimes when I ran the classification with the full data give me MemoryError which costs me days and up a VM on a cloud to continue the project plus time to fit the algorithms to date was a limiting factor to experiment more.
The usage of standard algorithms with 10% of the dataset to fit in the memory doesn't give me a good score so I have to try different techniques. One is ensemble the best model content only with the best model on numerical features and combines them with a third model. Another one is doing the hash trick and train model against this.
In this section will do the hash trick on all features.
from sklearn.feature_extraction import FeatureHasher
train_bool_transform_path = "../working/1_train_bool_transform.pkl"
train_features = pd.read_pickle(train_bool_transform_path)
hasher = FeatureHasher()
train_features = hasher.transform(train_features.T.to_dict().values())
train_features = train_features.astype(np.float32)
train_features.data = np.nan_to_num(train_features.data)
train_labels = pd.read_pickle('../working/1_test_reduced.pkl')
train_labels = np.array(train_labels.values)[:, 1:].astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(train_features, train_labels, test_size=0.33, random_state=42)
del train_features
del train_labels
gc.collect()
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(RandomForestClassifier(n_estimators=10, n_jobs=-1, verbose=1))
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
# TS Baseline => 0.0150548
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
lr = OneVsRestClassifier(LinearSVC(verbose=1))
lr = lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(256, 1024, 256), max_iter=500, alpha=1e-4,
verbose=10, tol=1e-4, random_state=1)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
from sklearn.linear_model import SGDClassifier
from sklearn.multiclass import OneVsRestClassifier
sgd = OneVsRestClassifier(SGDClassifier(loss='log', verbose=1, average=10000))
sgd = sgd.fit(X_train, y_train)
y_pred = sgd.predict(X_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
Running the algorithms with the hashing trick give the following scores:
| Model | Score | Scaled + Content Dataset Score |
|---|---|---|
| RandomForestClassifier | 0.24398 | 0.23732 |
| LinearSVC | 3.40176 | 0.16703 |
| MLPClassifier | 1.08046 | 0.15581 |
| SGDClassifier | 1.04880 | - |
As we see, no good for anyone of the algorithms with the hash trick. I will continue to investigate to choose the best to feed the full data. The next one is the ensemble technique.
I will choose the Random Forest to the numerical part of the dataset and LinearSVC to the content part of the dataset using the prediction of both as input to another Random Forest that will make the prediction.
X_meta = []
X_meta_test = []
# loading the numerical dataset
train_numerical_path = "../working/9_train_wh_pca.pkl"
train_features = pd.read_pickle(train_numerical_path)
train_labels = pd.read_pickle('../working/1_test_reduced.pkl')
train_features= np.array(train_features.values).astype(np.float32)
train_labels = np.array(train_labels.values)[:, 1:].astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(train_features, train_labels, test_size=0.33, random_state=42)
del train_features
del train_labels
gc.collect()
# training the numerical meta
from sklearn.ensemble import RandomForestClassifier
rf = OneVsRestClassifier(RandomForestClassifier(n_estimators=10, verbose=1))
rf = rf.fit(X_train, y_train)
X_meta.append(rf.predict_proba(X_train))
X_meta_test.append(rf.predict_proba(X_test))
# loading the content dataset
train_only_content_encoding = "../working/4_train_only_content_encoding.pkl"
train_features = pd.read_pickle(train_only_content_encoding)
train_features = train_features.astype(np.float32)
train_features.data = np.nan_to_num(train_features.data)
train_labels = pd.read_pickle('../working/1_test_reduced.pkl')
train_labels = np.array(train_labels.values)[:, 1:].astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(train_features, train_labels, test_size=0.33, random_state=42)
del train_features
del train_labels
gc.collect()
# training the content meta
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
svc = OneVsRestClassifier(LinearSVC(verbose=1))
svc = svc.fit(X_train, y_train)
X_meta.append(svc.decision_function(X_train))
X_meta_test.append(svc.decision_function(X_test))
X_meta[0].shape
X_meta[1].shape
# bring together both meta
X_all_meta = np.column_stack(X_meta)
X_all_meta.shape
X_all_meta_test = np.column_stack(X_meta_test)
X_all_meta_test.shape
# trainig the new classifier
meta = OneVsRestClassifier(RandomForestClassifier(n_estimators=30, verbose=1))
meta = meta.fit(X_all_meta, y_train)
y_pred = meta.predict(X_all_meta_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
# cross validating this classifier
from sklearn.metrics import log_loss, make_scorer
from sklearn.cross_validation import cross_val_score
log_loss_scorer = make_scorer(log_loss, needs_proba = True)
rf_cv = OneVsRestClassifier(RandomForestClassifier(n_estimators=30, verbose=1, n_jobs=1))
scores = cross_val_score(rf_cv, X_all_meta, y_train, cv = 4, n_jobs = 1, scoring = log_loss_scorer)
print(np.mean(scores))
print(np.std(scores))
This ensemble classifier is the best I can get with 10% of the dataset, so I will train him with 100% of the dataset and then try to improve with hyperparameter tuning.
First, preprocessing all the dataset, them training the ensemble model.
train_features = pre.transform_all()
train_labels = d.read_train_labels()
# drop y14, it is always 0
train_labels.drop(labels=['y14'], axis="columns", inplace=True)
train_labels = np.array(train_labels.values)[:, 1:].astype(np.float32)
X_train, X_test, X_content_train, X_content_test, y_train, y_test = \
train_test_split(train_features, content_df, train_labels, test_size=0.25, random_state=42)
del train_labels
del train_features
del content_df
del pca_wh
del float_scaler
del int_scaler
gc.collect()
pre.save_all(X_train, X_test, X_content_train, X_content_test, y_train, y_test)
X_train, X_test, X_content_train, X_content_test, y_train, y_test = pre.load_all()
print('X_train.shape={}'.format(X_train.shape))
print('X_test.shape={}'.format(X_test.shape))
print('X_content_train.shape={}'.format(X_content_train.shape))
print('X_content_test.shape={}'.format(X_content_test.shape))
print('y_train.shape={}'.format(y_train.shape))
print('y_test.shape={}'.format(y_test.shape))
Starting the training
X_meta = []
X_meta_test = []
# training the numerical meta
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
rf = OneVsRestClassifier(RandomForestClassifier(n_estimators=100, verbose=1, n_jobs=-1))
rf = rf.fit(X_train, y_train)
X_meta.append(rf.predict_proba(X_train))
X_meta_test.append(rf.predict_proba(X_test))
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
svc = OneVsRestClassifier(LinearSVC(verbose=1))
svc = svc.fit(X_content_train, y_train)
X_meta.append(svc.decision_function(X_content_train))
X_meta_test.append(svc.decision_function(X_content_test))
X_all_meta = np.column_stack(X_meta)
X_all_meta_test = np.column_stack(X_meta_test)
# trainig the new classifier
meta_clf = OneVsRestClassifier(RandomForestClassifier(n_estimators=30, verbose=1, n_jobs=-1))
meta_clf = meta_clf.fit(X_all_meta, y_train)
y_pred = meta_clf.predict(X_all_meta_test)
log_loss(y_test.flatten(), y_pred.flatten(), 1e-15)
pickle.dump(rf, open('../working/25_base_rf.pkl', 'wb'))
pickle.dump(svc, open('../working/26_base_svc.pkl', 'wb'))
pickle.dump(meta, open('../working/27_meta_rf.pkl', 'wb'))
# cross validating this classifier
from sklearn.metrics import log_loss, make_scorer
from sklearn.cross_validation import cross_val_score
log_loss_scorer = make_scorer(log_loss, needs_proba = True)
rf_cv = OneVsRestClassifier(RandomForestClassifier(n_estimators=30, verbose=1, n_jobs=-1))
scores = cross_val_score(rf_cv, X_all_meta, y_train, cv = 4, n_jobs = 1, scoring = log_loss_scorer)
print(np.mean(scores))
print(np.std(scores))
The final ensenble model pass to a evaluation of the score by cross-validation and the score has a 0.20268 as median score and a 0.00031 as standard deviation show that is a robust model and can be trusted.
My final solution beat the Random Benchmark and the All Halves Benchmark but can't beat the TS Baseline Benchmark, my best model scores were 0.09285 and the TS Baseline was 0.0150548. My results are not significant enough to have solved the problem but I consider the TS Baseline are already at that stage because it was from a production product.
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
The classification report shows that besides not beating the benchmark, the f1-score of good part of the labels are good enough to enrich the data of the text block.
The process of participating in a Kaggle competition, download a dataset, explore the data, explore the data with visualization, do some data preprocessing for the dataset, trying some different algorithms, iterating on the different techniques was an amazing experience and give me a boost on confidence that someday I can be a machine learning engineer, I thank you Udacity for this amazing course. This process was very interesting and I learn a lot with, one of the main learnings was that RAM memory can be a huge problem, a number of times have to reboot my notebook because the dataset and the transformations of the data fill up the memory and the swap freezing my notebook. Another interesting thing is the dataset of the competition having 140+ features and 33 labels and 1.7 million rows but with no explanation of what is each feature and label, so I have to train an algorithm more generic to treat this. I learn a lot doing this solution and this was my real expectation because I know it would be a hard task. The proposed solution can be used on similar problems and be of great value.
There are techniques that could be made on the proposed solution to get a better result as try to derive some relationship between the content features or use an online learning model. Others algorithms that I research but did not implement are deep learning neural networks, some embeddings to the content features. My solution almost did the benchmark, so for leaderboard of the competition for sure has better solutions than mine and this competition has 4 years, so deep learning techniques can create a new benchmark for this dataset.